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Abstract 

The problem of extracting as much information as possible from a 
sequence of observations of a stationary stochastic process Xq,X\, ...X n 
has been considered by many authors from different points of view. 
It has long been known through the work of D. Bailey that no uni- 
versal estimator for P(X n+ i \Xq, X\, ...X n ) can be found which con- 
verges to the true estimator almost surely. Despite this result, for 
restricted classes of processes, or for sequences of estimators along 
stopping times, universal estimators can be found. We present here 
a survey of some of the recent work that has been done along these 
lines. 
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1 Introduction 



In a short communication that appeared in the Proceedings of the First 
International IEEE-USSR Information Workshop [7], Tom Cover formulated 
a number of problems that have generated a substantial literature during 
the past thirty years. We plan to survey a portion of these works, biased to 
be sure by our own intersets. We begin by quoting from Cover's paper and 
recalling his first two questions: 

" 1. A Question on the Prediction of Ergodic Processes 
The statement that "we can learn the statistics of an ergodic process from 
a sample function with probability 1" is being investigated for operational 
significance. 

Let {^n}^°oo be a stationary binary ergodic process with conditional 
probability distributions p(x n+ i\x n , . . . ,Xi), n = 1,2,... . We know that 
we can learn the statistics with probability 1, but can we learn p fast 
enough? In other words, does there exist an estimate p : X x X* — > [0, 1], 
X* = collection of all finite strings, for which 

p(X n+1 \X n , . . . ,X ± ) -p(X n+1 \X n , . . . ,Xi) -> 

with probability 1? 

Does there also exist a predictor p yielding the convergence of 

p(X \X- U X_ 2 , . . . , X_ n ) - p{X Q \X. x , X_ 2 , . . .)? 

Since the statement of this problem, Bailey and Ornstein have obtained some 
as yet unpublished results on this question that indicate a negative answer 
to the first question and a positive answer to the second." 

Since the processes are stationary, the (second) backward prediction prob- 
lem is equivalent to the (first) forward prediction problem as far as conver- 
gence in probability is concerned. However, for almost sure results it turns 
out that they are far from being the same. Ornstein [30] gave a rather 
complicated algorithm for the backward prediction problem whereas Bailey 
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provided a proof for the nonexistence of a universal algorithm guaranteeing 
almost sure convergence in the forward estimation problem. To do this, Bai- 
ley in [5], assuming the existence of a universal algorithm, used the Ornstein's 
technique of cutting and stacking [31] for the construction of a "counterex- 
ample" process for which the algorithm fails to converge (see Shields [34] for 
more details on this method). 

The problem came to life again in the late eighties with the work of 
Ryabko [33]. He used a simpler technique, namely - relabelling a countable 
state Markov chain, in order to prove the nonexistence of a universal esti- 
mator for Cover's first problem (cf. also Gyorfi, Morvai and Yakowitz [11]). 
In addition there was a growing interest in universal algorithms of various 
kinds in information theory and elsewhere, see Feder and Merhav [10] for a 
survey. 

Three approaches evolved in an attempt to obtain positive results for the 
problem of forward estimation in the face of Bailey's theorem. 

The first modifies the almost sure convergence to convergence in prob- 
ability or almost sure convergence of the Cesaro averages. This was done 
already by Bailey in his thesis. Cf. Algoet [2, 3] and Weiss [36]. 

The second gives up on trying to estimate the distribution of the next 
output at all time moments n, and concentrates on guaranteeing prediction 
only at certain stopping times, while the third restricts the class of processes 
for which the scheme is shown to succeed. 

Our interest in this circle of ideas began with the PhD thesis of the first 
author [15] in which he gave an algorithm for the backward prediction that 
was much simpler than Ornstein's original scheme (cf. Morvai, Yakowitz and 
Gyorfi [27] ). Before describing briefly the contents of the survey we will 
present this scheme with a sketch of the proof of its validity. Let {X n }^_ 00 
be a stationary and ergodic time series taking values from X = {0, 1}. (Note 
that all stationary time series {X n }^L can be thought to be a two sided 
time series, that is, {Xn}™^^. ) For notational convenience, let X^ = 
(X m , . . . , X n ), where m < n. 
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Here is the algorithm. For k = 1,2,..., define sequences Afc_i and 
recursively. Set A = 1 and let be the time between the occurrence of the 
pattern XZ\ kl at time — 1 and the last occurrence of the same pattern prior 
to time —1. Formally, let 

r k = mm{t>Q:Xzl- k t _ 1 -t = Xzl k J. 

Put 

Afc = T~k + Afc_i, 

where A& is the length of the pattern 

The observed vector Xz\ k _ 1 almost surely takes a value of positive proba- 
bility; thus by stationarity, the string Xz\ k l must appear in the sequence 
Xz^o almost surely. One denotes the A;th estimate of P(X = l\Xzlc) by Pk, 
and defines it to be 

1 k 

« j=l 

As in Ornstein [30], the estimate Pk is calculated from observations of random 
size. Here the random sample size is A&. To obtain a fixed sample-size 
< t < oo version, we apply the same method as in Algoet [1], that is, let 
K t be the maximum of integers k for which A^ < t. Formally, 

K t = max{/c > : Afc < t}. 

Now put 

P-t = P Kt - 

The following theorem was established in the PhD thesis of Morvai [15]. 

Theorem 1.1 (Morvai [15]) For any stationary and ergodic binary time 
series {X n }, 

lim = P(X = llXzlo) almost surely. 

t^oo 
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Proof. We have 

P k -P(X = l\XZ 1 J 

K j=l 

+ lE p (X- Tj = Mxz^J - p(x = i\xzl). 

Observe that the first term is an average of a bounded martingale differ- 
ence sequence and by Azuma's exponential bound for bounded martingale 
differences [4] we get that the first term tends to zero. Morvai showed in his 
PhD thesis that 

p(x_ Tj = Mxz^j = p(x = iixz^j. 

This observation is the key to handling the second term: 

\ E p (X- Tj = i|^,_J - p(x = i\x:l) 
= lT, p (x = i\xzl j _ i )-p(x = i\xz 1 j. 

« j=l 

By the martingale convergence theorem, 

P(X = MXZI^) -> P(X = 1\XZ^) almost surely, 

and since ordinary convergence implies Cesaro convergence this completes 
the proof of the theorem. □ 

In this survey we will restrict ourselves to finite or countably valued pro- 
cesses. Some of the directions that we survey have been generalized to real 
valued processes and some even to processes taking values in more general 
metric spaces. Some of the key papers in these directions are Algoet [1, 2, 3], 
Morvai et. al. [27, 26], Weiss [36] and Nobel [28]. 
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We turn now to a brief description of the contents of our survey. In §2 
we will describe some classes of processes that will play an important role 
for us. Next §3 will contain a scheme for forward prediction at all n which 
can be shown to converge to the optimal prediction for the class of processes 
with continuous conditional probabilities. This class includes of course fc-step 
Markov chains for any k. 

In §4 we turn to a description of a sequence of stopping times together 
with estimators which converge along that sequence to the conditional prob- 
ability estimator for all processes. This sequence of stopping times grows 
rather quickly and we give a sequence with a slower growth rate but we 
can demonstrate the convergence only for processes whose conditional prob- 
abilities are almost surely continuous. Then in §5 for finitarily Markovian 
processes we give stopping times with an even slower growth rate. The fol- 
lowing section considers this class in more detail with respect to the problem 
of estimating the length of the memory word that occurs as the context at 
time n. 

We conclude with a series of constructions and examples in §§7 — 9 that 
show the optimality of many of these results. Along the way several open 
questions are mentioned since much remains to be done before we achieve a 
complete understanding of what is possible and what is not. 

2 Preliminaries - Classes of Stochastic Pro- 
cesses 

Let X be discrete (finite or countably infinite) alphabet. Let {X n } be a 
stationary and ergodic time series. 

For notational convenience let p(x°_ k ) and p(y\x°_ k ) denote the distribution 
P(X\ = x°_ k ) and the conditional distribution P(X\ = y\X\ = x°_ k ), 
respectively. 
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Definition 1. For a stationary time series {X n } the (random) length ^(X ^) 
of the memory of the sample path X ^^ is the smallest possible < K < oo 
such that for alH > 1, all y E X, all zZx-i+i £ X 1 

p(y\X°_ K+1 )=p(y\zl«_ i+1: X°_ K+1 ) 

provided p(zZx-i+i: X^k+i, u) > 0, and K(X _ oo ) = oo if there is no such K. 

Note that we denote the random variables by capital letters and particular 
realizations by lower case letters. For example, p(y\X^_ K+1 ) is denoting the 
random variable which is a function of the random variables X°_k+i taking 
the value P(X 1 = y\X\ = x\) when X° k = x°_ k . 

Definition 2. The stationary time series {X n } is said to be finitarily Marko- 
vian if ^(X ^) is finite (though not necessarily bounded) almost surely. 

This class includes of course all finite order Markov chains but also 
many other processes such as the finitarily determined processes of Kalikow, 
Katznelson and Weiss [13], which serve to represent all isomorphism classes of 
zero entropy processes. For some concrete examples that are not Markovian 
consider the following example: 

Example 1. Let {M n } be any stationary and ergodic first order Markov 
chain with finite or countably infinite state space S. Let s e S be an arbitrary 
state with P(M l = s) > 0. Now let X n = I{ Mn =s}- By Shields ([35] Chapter 
I.2.C.1), the binary time series {X n } is stationary and ergodic. It is also 
finitarily Markovian. (Indeed, the conditional probability P(Xi = llX .^) 
does not depend on values beyond the first (going backwards) occurrence of 
one in which identifies the first (going backwards) occurrence of state s 
in the Markov chain {M n }. ) The resulting time series {X n } is not a Markov 
chain of any order in general. (Indeed, consider the Markov chain {M n } with 
state space S = {0, 1,2} and transition probabilities P(X 2 = l\Xi = 0) = 
P(X 2 = 2\Xi = 1) = 1, P(X 2 = 0|Xi = 2) = P(X 2 = l\Xi = 2) = 0.5. 
This yields a stationary and ergodic Markov chain {M„}, cf. (Example 1.2.8 
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in Shields [35]. Clearly, the resulting time series X n = J{ Mn= o} will not be 
Markov of any order. The conditional probability P(X 1 = 0|-X"° oo ) depends 
on whether until the first (going backwards) occurrence of one you see even 
or odd number of zeros.) These examples include all stationary and ergodic 
binary renewal processes with finite expected inter-arrival times, a basic class 
for many applications. (A stationary and ergodic binary renewal process 
is defined as a stationary and ergodic binary process such that the times 
between occurrences of ones are independent and identically distributed with 
finite expectation, cf. Chapter I.2.C.1 in Shields [35]). 

Let X*~ be the set of all one-sided sequences, that is, 

X*~ = {(..., X-i,Xo) : Xi G X for all — oo < i < 0}. 

Let / : X — > (— oo, oo) be bounded, otherwise arbitrary. Define the function 

F : X*~ — > (— oo, oo) as 

F(xiJ=E(f(X 1 )\X _ oo = x°_ oo ). 

E.g. if f(x) = 1 {X=Z} for a fixed z G X then J = P(X 1 = ^X ^ = 
y^oo). If X is countably infinite subset of the reals and f(x) = x then 
F{y»-oo) = E{X l \X«_ 00 = y»_ 00 ). 

Define the distance d*(-, •) on X*~ as follows. For x^^, y"^ G X*~ let 

oo 

d*(^oo,l/-J = E 2 " < " ll {*-^-«}- 

Definition 2.1 We say that F(X° oc ) is continuous if a version of the func- 
tion F^X^^) on the whole set X*~ is continuous with respect to metric 
cf (•,•). 

As we have already mentioned any fc-step Markov chain satisfies this, but 
there are also many examples with unbounded memory. S. Kalikow showed 



8 



in [12] that the class can also be characterized as those processes which can 
be constructed as random Markov chains. In this procedure, given a past 
one invokes an auxiliary independent process which chooses a random 
memory length K and then X\ is chosen according to a fixed transition table 
from X K to X. 

Definition 2.2 We say that F^X^^) is almost surely continuous if for some 
set C C X*' which has probability one a version of the function -FpT^) 
restricted to this set C is continuous with respect to metric d*(-, •). 

This class is strictly larger than the processes with continuous conditional 
distributions. It contains many of the examples that have been used to 
demonstrate the limitations of universal schemes. In particular, it contains 
the class of finitary Markov processes where the usual continuity may not 
hold (cf. Morvai and Weiss [17]). 

3 Forward estimation for processes with con- 
tinuous conditional distributions 

For simplicity we will restrict our detailed presentation to the case where 
{X n } is a stationary and ergodic binary time series. As we have remarked, 
since we are interested primarily in pointwise results the restriction to ergodic 
processes doesn't lead to any loss of generality, while the extension to finite 
state processes is completely routine. Our goal is to estimate the conditional 
probability P(X n+ i = 1\Xq) knowing only the samples Xq but not the nature 
of the process. 

The following algorithm which was introduced in Morvai and Weiss [18] 
has several nice features. For processes with continuous conditional distri- 
bution the algorithm will almost surely give better and better prediction for 
X n+ i while for all other processes some type of convergence will obtain. For 
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k > 1 define the random variables r^(n) which indicate where the /c-block 
X™_ k+1 occurs previously in the time series {X n }. Formally we set Tq (n) = 
and for i > 1 let 

rf (n) = min{t > ^(n) : ^4 +1 _ t = X n % +1 }. 

Let K„ > 1 and J n > 1 be sequences of nondecreasing positive integers 
tending to oo which will be fixed later. 

Define n n as the largest 1 < k < K n such that there are at least J n occur- 
rences of the block X™_ k+1 in the data segment Xq, that is, 

n n = max{l < k < K n : Tj n (n) < n — k + 1} 

if there is such k and otherwise. 

Define A n as the number of occurrences of the block X"_ Kn+1 in the data 
segment Xq, that is, 

\ n = max{l < j : r* n < n - K n + 1} 

if K n > and zero otherwise. Observe that if n n > then \ n > J n . 
Our estimate g n for P(X n+1 = 1\Xq) is defined as g = and for n > 1, 

■y A n 
An i=1 

if /t n > and zero otherwise. 

Theorem (Morvai and Weiss [18]) Let {X n } be a stationary and ergodic time 
series taking values from a finite alphabet X . Assume K n = max(l, [0.1 log^ n\) 
and J n = max(l, [n - 5 ]). Then 

(A) if the conditional expectation P(X 1 = ljX^) is continuous with respect 
to metric d*(-,-) then 

lim \g n — P(X n+1 = 1\Xq) \ = almost surely, 
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(B) without any continuity assumption, 

■y n — l 

lim — \gi — P(X i+1 = 1\Xq)\ = almost surely, 
n ^°° n i=0 

(C) without any continuity assumption, for arbitrary e > 0, 

lim P(\g n - P(X n+1 = 1\X%)\ > e) = 0. 



Remarks: 

We note that from the proof of Ryabko [33] and Gyorfi, Morvai, Yakowitz [11] 
it is clear that the continuity condition in the first part of the Theorem can 
not be relaxed. Even for the class of all stationary and ergodic binary time- 
series with merely almost surely continuous conditional probability P(Xi = 
1| . . . , X_i, X ) one can not achieve the convergence as in part (A). 

We do not know if the shifted version of our proposed scheme g n solves the 
backward estimation problem or not. That is, in the case when g n is evaluated 
on (X_ n , . . . , Xq) rather than on (X , . . . , X n ), we expect convergence to be 
hold for all processes but we have been unable to prove this. 

It is known that when the algorithms of Ornstein [30], Algoet [1], Morvai 
Yakowitz and Gyorfi [27] for the backward estimation problem are shifted 
forward parts (B) and (C) hold. For part (C) this is immediate from sta- 
tionarity while for part (B) it follows from a generalized ergodic theorem, 
usually attributed to Breiman, but first proved by Maker [14]. Thus there is 
no novelty in the existence of some scheme with these properties. However, 
for the above algorithm all three properties hold. We should also point out 
that if one knows that the process is /c-step Markov for some fixed k then of 
course it is not very hard to see that that the empirical distributions of the 
k + 1-blocks converge almost surely by the ergodic theorem and this easily 
forms the basis of a scheme which will succeed in the forward prediction of 
these processes. 
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4 Estimating Along Stopping Times 



The forward prediction problem for a binary time series {X n }^L is to esti- 
mate the probability that X n+1 = 1 based on the observations X h < i < n 
without prior knowledge of the distribution of the process {X n }. It is known 
that this is not possible if one estimates at all values of n. Morvai [16] 
presented a simple procedure which will attempt to make such a prediction 
infinitely often at carefully selected stopping times chosen by the algorithm. 
The growth rate of the stopping times can be determined. Here is his scheme. 

Let {^n}^_oo denote a two-sided stationary and ergodic binary time 
series. For k — 1, 2, . . ., define the sequences {r^} and {\ k } recursively. Set 
A = 0. Let 

r k = mm{t>0:X^- 1+t = X^- 1 } 

and 

Afc = r k + Afc_i. 

(By stationarity, the string Xg fc_1 must appear in the sequence X^° almost 
surely. ) The kth estimate of P(X\ k+1 = l|XQ fc ) is denoted by P k , and is 
defined as 

^ k-l 

Theorem 4.1 ( Morvai [16] ) For all stationary and ergodic binary time 
series {X n }, 

hm (P k - P(X Xk+1 = l|X Afc )) = almost surely. 

For some extensions of the algorithm see Morvai and Weiss [19]. 

One of the drawbacks of this scheme is that the growth of the stopping times 
{Afc} is rather rapid. 
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Theorem 4.2 ( Morvai [16] ) Let {X n } be a stationary and ergodic binary 
time series. Suppose that H > where 

H= lim l —E\ozp{X Q ,...,X n ) 

n->oo n + 1 

is the process entropy. Let < e < H be arbitrary. Then for k large enough, 

c 

A/c(cc;) > c c almost surely, 

where the height of the tower is k — d, d{uj) is a finite number which depends 
on uj , and c = 2 H ~ t . 

Morvai and Weiss [17] exhibited an estimator which is consistent on a 
certain stopping time sequence for a restricted class of stationary time series 
but which has a much slower rate of growth. 

Define the stopping times now as follows. Set Co = 0. For k — 1,2,..., define 
sequence r] k and Cfc recursively. Let 

Vk = min{t > : X^-lk-q+t = and ^ k = + ? ^ 

One denotes the fcth estimate of P(X^ k+1 = l|Xo" fc ) by g k , and defines it to 
be 

^ k-l 

9k = ?EV' 

3=0 

Theorem 4.3 ( Morvai and Weiss [17] ) Let {X n } be a stationary binary 
time series. Then 



lim 

fc— >oo 



g k — P(X^ k+ i = l\XQ k ) = almost surely 



provided that the conditional probability P(X 1 = ljX^) is almost surely 
continuous. 
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Remark. We note that for all stationary binary time-series, the estimation 
scheme described above is consistent in probability. 

Next we will give some universal estimates for the growth rate of the stopping 
times (k in terms of the entropy rate of the process. This is natural since 
the Cfc are defined by recurrence times for blocks of length k, and these are 
known to grow exponentially with the entropy rate. 

Theorem 4.4 ( Morvai and Weiss [17] ) Let {X n } be a stationary and 
ergodic binary time series. Then for arbitrary e > 0, 

(k < 2 k<yH+e ^ eventually almost surely, 

where H denotes the entropy rate associated with time series {X n }. 

This upper bound is much more favourable than the lower bound in Mor- 
vai [16]. For some extensions of this algorithm see Morvai and Weiss [24]. 

5 Some Improvements for Finitarily Marko- 
vian Processes 

Let {^nj^-oo be a stationary and ergodic (not necessarily finitarily Marko- 
vian) time series taking values from a discrete (finite or countably infinite) 
alphabet X. Morvai and Weiss [23] provided the following algorithm which 
improves the performance of the previous one in case the process turns out 
to be finitarily Markovian. 

For k > 1, let 1 < Ik < k be a nondecreasing unbounded sequence of integers, 
that is, 1 = li < li . . . and linn^oo Ik = oo. 

Define auxiliary stopping times ( similarly to Morvai and Weiss [17]) as fol- 
lows. Set Co = 0. For n — 1, 2, . . ., let 

( n = Cn-i + min{t > : X^l- 1)+t = *£--<i»-i)}- 
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Note that if l n — n then one gets ( n = r] n in Morvai and Weiss [17]. The 
point here is that l n may grow slowly. 

Among other things, using ( n and l n we can define a very useful process 
{^n}n=-oo as a function of X£° as follows. Let J{n) = min{j > 1 : > n} 
and define 

*-i = *C, w -* fori>0. 

In order to estimate If (X^) we need to define some explicit statistics. 
Define 



sup sup 

1 - < {*zt_ i+1 ex\xex:p(zz k k _ i+1 ,x°_ k+1 ,x)>0} 



p(x\X°_ k+1 )-p(x\(zZ k k _ i+1 ,X°_ k+1 )) 



r-i-i 

We will divide the data segment Xq into two parts: X 2 and X^ . Let 

D n ' k denote the set of strings with length k + 1 which appear at all in X 2 
That is, 

4*1 = G ^ fe+1 = 3fc < t < ^1 - 1 : = *°*}. 

For a fixed < 7 < 1 let C n k denote the set of strings with length k + 1 
which appear more than n 1-7 times in X^ . That is, 

1 4- k < 

2 



Let 



'"k J "n,k I I ^n, 



We define the empirical version of as follows: 

A2(X _ k+ - ] ) = max max ls Cu ,. < rni- 1 \ 

#{[fl+A:<t<n:X*_ fe = (A > Vi^)} 



#{ [f 1 + fe - 1 < t < n - 1 : X\_ k+1 = X° fc+1 } 

#{ ft 1 + * + i < t < n : = XVi, x)} 



#m+k + i-l<t<n-l: XU_ i+1 = (zI k k _ i+1 ,X°_ k+1 )} 
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Note that the cut off l{^ J(fc) <|-«]_i} ensures that X°_ k+1 is defined from X 2 
Observe, that by ergodicity, for any fixed k, 

liminf A? > A k almost surely. 

n— >oo 

We define an estimate Xn for K{X°_ OQ ) from samples Xq as follows. Let 
< (3 < ^-p be arbitrary. Set Xo — 0, and for n > 1 let Xn be the smallest 
< k n < n such that A^ n < n' 13 . 
Observe that if Q < |~|] — 1 < Q + i then Xn < 

Here the idea is that if iffX ^) < oo then Xn wm be equal to XfX ^) 
eventually and if K^X®^) = oo then Xn — > oo. 

Now we define the sequence of stopping times A„ along which we will be able 
to estimate. Set A = Co> an d for n > 1 if Q < A n _i < Cj+i then put 

\ n = min{t > A n _x : = X§„ Xt+1 } 

and 

Observe that if (? < A n _i < then Q < A n _i < A n < Cj+i- if Xa„_i+i = 
then A n = \ n -i + 1- Note that A n is a stopping time and « n is our estimate 
for K^X ^) from samples X^ n . 

Let / : X — > (— oo, oo) be bounded. One denotes the nth estimate of 
£(/(X An+ i)|Xo n ) from samples X^ n by / n , and defines it to be 

n— 1 

/n = - J! /(^A 3 +l)- 

n j=o 

Fix positive real numbers < (3, 7 < 1 such that 2/5 + 7 < 1, fix a sequence /„ 
that 1 — li < I2, ■ ■ ., l n — ¥ 00 an d fix a bounded function /(•) : X — > (—00, 00) 
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and with these numbers, sequence and function define ( n , Xn, K- n , X n and 
F(-) as described in the previous section. For the resulting f n we have the 
following theorem: 



Theorem 5.1 ( Morvai and Weiss [23] ) Let {X n } be a stationary and 
ergodic time series taking values from a finite or countably infinite set X . If 
the conditional expectation F(X^_ OQ ) is almost surely continuous then almost 
surely, 



Km /„ = F(X _J and lim f n - E{f{X Xn+1 )\X^) 



0. 



For arbitrary 5 > 0, < e 2 < €\, let l n = min (n, max^l, L^frfj l°g2 n j)) ■ 
Then 

eventually almost surely, and the upper bound is a polynomial whenever the 
stationary and ergodic time series {X n } has finite entropy rate H . 
If the stationary and ergodic time series {X n } turns out to be finitarily 
Markovian then 

lim — = — =r- < oo almost surely. 

Moreover, if the stationary and ergodic time series {X n } turns out to be 
independent and identically distributed then \ n = \ n -i + 1 eventually almost 
surely. 



6 Estimation for Finitarily Markovian Pro- 
cesses 

In this section we broaden the scope of the estimation question that we will 
discuss and describe first how well can we detect the presence of a memory 
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word in a finitarily Markovian process ( cf. Morvai and Weiss [25] ). This 
problem has been discussed often in the context of modelling processes. Here 
we will show how it relates to prediction questions. 

Recall that K was the minimal length of the context that defines the 
conditional probability. We take up the problem of estimating the value of 
K, both in the backward sense and in the forward sense, where one observes 
successive values of {X n } for n > and asks for the least value K such 
that the conditional distribution of X n+1 given {Xi}" =n _ K+1 is the same as 
the conditional distribution of X n+ i given {X i }™ = „ 00 . We will consider both 
finite and countably infinite alphabet size. 

For the case of finite alphabet finite order Markov chains similar questions 
have been studied by Buhlman and Wyner in [6]. However, the fact that we 
want to treat countable alphabets complicates matters significantly. The 
point is that while finite alphabet Markov chains have exponential rates of 
convergence of empirical distributions, for countable alphabet Markov chains 
no universal rates are available at all. 

This problem appears in Morvai and Weiss [21] where a universal estima- 
tor for the order of a Markov chain on a countable state space is given, and 
some of the techniques that are used in the proofs of the results described 
here have their origin in that paper. We note in passing, that in Morvai and 
Weiss [20] it is shown that there is no classification rule for discriminating 
the class of finitarily Markovian processes from other ergodic processes. 

The key notion is that of a memory word which can be defined as 
follows. 

Definition 6.1 We say that w°_ k+1 is a memory word if for all i > 1, all 



y E X, all z_\ 



e X 1 



p(y\w 



-fc+i 



) =p(y\z. 



-k 

-k-i+V 




provided p(z 



-k 

-k-i+n 



w°_ k+1 ,y) > 0. 
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Define the set W4 of those memory words w°_ k+1 with length k, that is, 

Wfc = {w°_ k+1 E X k : w°_ k+1 is a memory word}. 

Our first result is a solution of the backward estimation problem, namely 
determining the value of K(X^_ O0 ) from observations of increasing length of 
the data segments X°_ n . We will give in the next subsection a universal 
consistent estimator which will converge almost surely to the memory length 
^(X ^) for any ergodic finitarily Markovian process on a countable state 
space. The detailed proofs in Morvai and Weiss [25] are pretty explicit and 
given some information on the average length of a memory word and the 
extent to which the stationary distribution diffuses over the state space one 
could extract rates for the convergence of the estimators. We concentrate 
however, on the more universal aspects of the problem. 

As is usual in these kinds of questions , the problem of forward estimation, 
namely trying to determine K(X r ^ 00 ) from successive observations of Xq 
is more difficult. The stationarity means that results in probability can 
be carried over automatically. However, almost sure results present serious 
problems as we have already said. For some more results in this circle of 
ideas of what can be learned about processes by forward observations see 
Ornstein and Weiss [32], Dembo and Peres [9], Nobel [29], and Csiszar and 
Talata [8]. 

Recently in Csiszar and Talata [8] the authors define a finite context to 
be a memory word w of minimal length, that is, no proper suffix of w is a 
memory word. An infinite context for a process is an infinite string with all 
finite suffix having positive probability but none of them being a memory 
word. They treat there the problem of estimating the entire context tree in 
case the size of the alphabet is finite. For a bounded depth context tree, 
the process is Markovian, while for an unbounded depth context tree the 
universal pointwise consistency result there is obtained only for the truncated 
trees which are again finite in size. This is in contrast to our results which 
deal with infinite alphabet size and consistency in estimating memory words 
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of arbitrary length. This is what forces us to consider estimating at specially 
chosen times. 

In the second subsection we will present a scheme which depend upon a 
positive parameter e, and we guarantee that density of times along which the 
estimates are being given have density at least 1 — e. The last two subsections 
are devoted to seeing how this memory length estimation can be applied to 
estimating conditional probabilities. We do this first for finitarily Markovian 
processes along a sequence of stopping times which achieve density 1 — e. 
We do not know if the e can be dropped in this case for the estimation of 
conditional probabilities. 

We can dispense with e in the Markovian case. For this we use an ear- 
lier result of ours on a universal estimator for the order of a finite order 
Markov chain on a countable alphabet in order to estimate the conditional 
probabilities along a sequence of stopping times of density one. 



6.1 Backward Estimation of the Memory Length for 
Finitarily Markovian Processes 

Let {X n } be stationary and ergodic finitarily Markovian with finite or count- 
ably infinite alphabet. 

In order to estimate K(X^_ 00 ) we need to define some explicit statistics. The 
first is a measurement of the failure of w°_ k+1 to be a memory word. 

Define 

sup sup p(x\w°_ k+1 ) -p(x\zZk- i+1 ,w°_ k+1 ] 

Clearly this will vanish precisely when w°_ k+1 is a memory word. We need to 
define an empirical version of this based on the observation of a finite data 
segment X°_ n . To this end first define the empirical version of the conditional 
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probability as 

~ M n , _ #{-n + k - 1 < t < -1 : X^ +1 = (w°_ k+1 , x)} 
M ' ~ k+l) ~ #{-n + k-l<t<-l:XU +1 = W <L k+1 } ■ 
These empirical distributions, as well as the sets we are about to introduce 
are functions of X°_ n , but we suppress the dependence to keep the notation 
manageable. 

For a fixed < 7 < 1 let C k denote the set of strings with length fc + l which 
appear more than n 1-7 times in X°_ n . That is, 

CI = {x\ e X k+1 : #{-n + k<t<0:XL k = x°_ k } > n 1 ^}. 

Finally, define the empirical version of A k as follows: 

Afc(^Vl) = 3<¥ , ^ aX Pn{x\w°_ k+1 ) -p n (x\zZ k _ i+1 ,W°_ k+1 ) 

Let us agree by convention that if the smallest of the sets over which 
we are maximizing is empty then A k = 0. Observe, that by ergodicity, 
the ergodic theorem implies that almost surely the empirical distributions p 
converge to the true distributions p and so for any w°_ k+1 G X k , 

liminf A k (w°^ k+1 ) > A k (w°_ k+1 ) almost surely. 
With this in hand we can give a test for w°_ k+1 to be a memory word. Let 

< (3 < ^ be arbitrary. Let NTEST n (w°_ k+l ) = YES if A n k {w\ +l ) < n~P 
and NO otherwise. Note that NTEST n depends on X°_ n . 

Theorem 6.1 (Morvai and Weiss [25]) Eventually almost surely, NTEST n (w 
YES if and only if w°_ k+1 is a memory word. 

We define an estimate Xn for K(X _ oo ) from samples X°_ n as follows. Set 
Xo = 0, and for n > 1 let Xn be the smallest < k < n such that 
NTEST n {X°_ k+1 ) = YES if there is such and n otherwise. 

Theorem 6.2 (Morvai and Weiss [25]) Xn = K^X®^) eventually almost 
surely. 
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6.2 Forward Estimation of the Memory Length for Fini- 
tarily Markovian Processes 

Let {X n } be stationary and ergodic finitarily Markovian with finite or count- 
ably infinite alphabet. 

Define PTEST n (w°_ k+1 )(X£) = NTEST n {w\ +1 ){T n X%) where T is the left 
shift operator. 

Theorem 6.3 (Morvai and Weiss [25]) Eventually almost surely, PTEST n {w 
YES if and only if w°_ k+l is a memory word. 

Define a list of words {w(0), w(l), w(2), . . . , w(n), . . .} such that all words of 
all lengths are listed and a word can not precede its suffix. Note that w(0) 
is the empty word. 

Now define sets of indices A l n as follows. Let — {0,1, ... , n} and for i > 
define 

K = {\w(i)\ - 1 < j < n : Xi_ lw(i)l+1 = w(i)}. (1) 
Let e > be fixed. Define 6 n {e) < n to be the minimal j such that 

{Ji<j:PTEST n (w(i))=Y ES 



n+l 51 ^ (2) 

and n otherwise. We estimate for the length of the memory of X"^ looking 
backwards if n G Ui<e n (e),PTEST n (w(i))=YES ^n- The set of n's for which this 
holds will be the set for which we estimate the memory and we denote this 
set by Af. Note that the event n e Af depends only on Xft, and thus Af can 
be thought of as a sequence of stopping times. 
We define for n G Af, 

K n = min{* > : X£_ Ki) | +1 = w(i), PTEST n (w(i)) = YES}. 

For n G Af define 

p n (X") = \w(K n )\. 

Note that p n , 9 n , n n and Af depend on e, however, we will not denote this 
dependence on epsilon explicitly. 
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Theorem 6.4 (Morvai and Weiss [25]) Let e > be fixed. Then for n E Af, 



p n = K(X 1 ^ O0 ) eventually almost surely, (3) 

and 



n 



(4) 



For neAf, X%_ 

-pn+i a PP ears a t l^ast n 7 times eventually almost surely. 

6.3 Forward Estimation of the Conditional Probability 
for Finitarily Markovian Processes 

Let {X n } be stationary and ergodic finitarily Markovian with finite or count- 
ably infinite alphabet. Now our goal is to estimate the conditional probability 
P(X n+ i = x\Xq) on stopping times in a pointwise sense. 
Let Af be a sequence of stopping times such that eventually almost surely 
Xn-K{x n )+i appears at least n 1 ^ 7 times in Xq . 

Let p n be any estimate of the length of the memory from samples Xq such 
that p n - K{X r f 00 ) -> on Af. 

Define our estimate q n (x) of the conditional probability P(X n+1 = x\Xq) on 
Af as 

- (x) = #{Pn - 1 < i < n : Xl pn+1 = X^ pn+l , X n+1 = x} 
^ ] #{Pn ~l< l <n: Xl pn+1 = X^ pn+1 } 

Theorem 6.5 (Morvai and Weiss [25]) On n E Af, 

\Qn(x) — P(X n +i = x\Xq)\ — > almost surely. 

Corollary 6.1 For the stopping times Af and estimator p n in Theorem 6.4, 
Theorem 6.5 holds and the density of Af is at least 1 — e. 
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6.4 Forward Estimation of the Conditional Probability 
for Markov Processes 



Let {X n } be a stationary and ergodic finite or countably infinite alphabet 
Markov chain with order K. Let ORDEST n be an estimator of the order 
from samples Xq such that ORDEST n — > K almost surely. Such an estima- 
tor can be found e.g. in Morvai and Weiss [21]. Let n e Af if X™_ ORDESTn+l 
appears at least n 1 ""' times in Xq . Af is a sequence of stopping times. Let 

- ^ _ j^jORDESTn — 1 < i < n : Xl_ ORDESTn+1 = X%_ ORDESTn+1 , X n+1 = x} 
ff{ORDEST n — 1 < i < n : X\_ ORDESTn+l = X"_ ORDESTn+1 } 

Theorem 6.6 (Morvai and Weiss [25]) Assume ORDEST n equals the order 
eventually almost surely. Then on n G Af, 

\q n (x) — P(X n+ i = x\X™_ K )\ — > almost surely. 

and , r 

liminf ^n{o,i,. ..,»-i}i = 1 

n — >oo fi 

If the Markov chain turns out to take values from a finite set, then Af takes 
as values all but finitely many positive integers. 



7 Examples Illustrating Limitations 

For the class of all stationary and ergodic binary Markov-chains of some finite 
order the forward estimation problem can be solved. Indeed, if the time series 
is a Markov-chain of some finite order, we can estimate the order and count 
frequencies of blocks with length equal to the order. Bailey showed that one 
can't test for being in the class, cf. Morvai and Weiss [20] also. 

It is conceivable that one can improve the result of Morvai [16] or Morvai 
and Weiss [17] so that if the process happens to be Markovian then one 
eventually estimates at all times. It has been shown in Morvai and Weiss 
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[22] that this is not possible. This puts some new restrictions on what can 
be achieved in estimating along stopping times. 

Theorem 7.1 (Morvai and Weiss [22]) For any strictly increasing sequence 
of stopping times {A n } such that for all stationary and ergodic binary Markov- 
chains with arbitrary finite order, eventually \ n +i = A n + 1, and for any 
sequence of estimators {h n (X , . . . ,X\ n )} there is a stationary and ergodic 
binary time series {X n } with almost surely continuous conditional probability 
P(Xi = 1| . . . , X ), such that 

P flimsup \h n (X , ...,X Xn )- P(X Xn+1 = 1\X , . . . ,X A J| > o) > 0. 

Remark: Bailey [5] among other things proved that there is no sequence 
of functions {e n (Xo _1 )} which for all stationary and ergodic time series, if 
it turns out to be a Markov-chain, would be eventually 1 and otherwise. 
(That is, there is no test for the Markov property.) This result does not imply 
ours. On the other hand, our result implies Bailey's. (Indeed, if there were 
a test for Markov-chains in the above sense, we could apply the estimator in 
Morvai [16] or Morvai and Weiss [17] if the time series is not a Markov-chain 
of some finite order, and if the time series is a Markov-chain of some finite 
order we can estimate the order of the Markov chain and count frequencies 
of blocks with length equal to the order. 

Bailey [5] and Ryabko [33] proved less than our theorem. They proved 
the nonexistence of the desired estimator when the estimator should work 
for all stationary and ergodic binary time series and when all A n = n, that 
is, when we always require good prediction. 

8 Memory Estimation for Markov Processes 

In this section we shall examine how well can one estimate the local memory 
length for finite order Markov chains. In the case of finite alphabets this can 
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be done with stopping times that eventually cover all time epochs. (Indeed, 
assume {X n } is a Markov chain taking values from a finite set. Assume 
ORDEST n estimates the order in a pointwise sense from data Xq . Then let 

p n = min{0 < t < ORDEST n : PT E ST n (X^_ t+1 ) = YES} 

if there is such t and otherwise. Since ORDEST n eventually gives the right 
order and there are finitelly many possible strings with length not greater 
than the order thus p n = K{X n : 00 ) eventually almost surely by Theorem 6.3.) 

However, as soon as one goes to a countable alphabet, even if the order 
is known to be two and we are just trying to decide whether the X n alone is 
a memory word or not, there is no sequence of stopping times which is guar- 
anteed to succeed eventually and whose density is one, cf. Morvai and Weiss 
[25]. This shows that the e in the preceding sections cannot be eliminated. 

Theorem 8.1 ( Morvai and Weiss [25] ) There are no strictly increasing 
sequence of stopping times {\ n } and estimators {h n (X , . . . , X x „)} taking 
the values one and two, such that for all countable alphabet Markov chains 
of order two: 

lim — = 1 

n — >oc ^ 

and 

\im^ \h n (X 0: . . . , X\ n ) — fT(XQ n )| = with probability one. 

9 Limitations for Binary Finitarily Marko- 
vian Processes 

In the preceding section we showed that we cannot achieve density one in 
the forward memory length estimation problem even in the class of Markov 
chains on a countable alphabet. In this section we shall show something 



26 



similar in the class of binary (i.e. 0, 1) valued finitarily Markov processes. 
We will assume that there is given a sequence of estimators and stopping 
times, (h n , X n ) that do succeed to estimate successfully the memory length 
for binary Markov chains of finite order and construct a finitarily Markovian 
binary process on which the scheme fails infinitely often. Here is a precise 
statement: 

Theorem 9.1 ( Morvai and Weiss [25] ) For any strictly increasing se- 
quence of stopping times {\ n } and sequence of estimators {h n (X , . . . , X\ n )}, 
such that for all stationary and ergodic binary Markov chains with arbitrary 
finite order, lim^oo ^ = 1 , and 

lim \h n (X 0: . . . , X\ n ) — fT(Xg n )| = almost surely 

there is a stationary, ergodic finitarily Markovian binary time series such 
that on a set of positive measure of process realizations 

h n (X ,...,X Xn )^K(X x _^) 

infinitely often. 

In the final process X n that we constructed in Morvai and Weiss [25] we 
have P{K{X { ^_ ^) = k decays to zero exponentially fast and in particular is 
summable. It follows that with probability one eventually K(Xq) < n so 
that the reason for our failure to estimate the order correctly is not coming 
about because we don't even see the memory word. 

It is also worth pointing out the density of moments on which the esti- 
mator is failing is of density zero. It follows fairly easily from the ergodic 
theorem that if one is willing to tolerate such failures then a straightforward 
application of any backward estimation scheme will converge outside a set of 
density zero. 
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